1 Preparing Data for Tables
Now that we’ve covered the main types of variables, we can start thinking about how to prepare and visually represent data with figures and tables. In this section, we will focus on preparing data for tables.
In R, datasets are stored as dataframes, which are objects made up of rows and columns. Each row corresponds to an individual observation, while each column represents the variable that is being measured. There are several ways R stores variables: character for text, factor for categorical, integer for whole numbers, numeric for decimals, and logical for booleans (TRUE/FALSE).
Cleaning data can be the most time-consuming, yet arguably the most important, part of analysing data! Generally, it is
- wide format to long format
To understand R, we will examine these concepts using real-world datasets.
NAPLAN
The first dataset to be explored is the NAPLAN dataset. NAPLAN stands for the National Assessment Program - Literacy and Numeracy, and is an annual national assessment for all students in year 3, 5, 7, and 9. The assessment is designed to evaluate whether students are building essential literacy and numeracy skills (https://www.nap.edu.au/naplan/faqs/naplan–general).
To load in the dataset, we can use the read.csv() function.
naplan <- read.csv("Data-sets/naplan_reading.csv")Try doing this yourself.
naplan <- read.csv("Data-sets/naplan_reading.csv")
naplan <- read.csv("Data-sets/naplan_reading.csv")Now that we have the dataset loaded in, let’s take a look at the variables in the dataset using str().
str(naplan)
str(naplan)In the top line, we can see that there are 3000 observations and 11 variables.
We can confirm this using the dim() function
dim(naplan)
dim(naplan)Using the names() function, we can see the variable names.
names(naplan)
names(naplan)The first column gives us the variables, the second column gives us the type of variable, and the third column gives us individual observations for each row. However, we can see school_type is stored as a chr or character (text) variable. Text variables are very common when you encounter surveys that have customer feedback, for example, as each response will be unique from one another.
It is important to check that your variables are stored correctly. We can see certain variables like reading time at home is an integer, which matches our expectations.
You can use the unique() function to see the unique groupings or labels within a categorical variable. In our case, we can use it to confirm if our suspicions that school_type is categorical. To call a specific variable, we call the dataset followed by $ and then the variable name.
unique(naplan$school_type)[1] "Government" "Independent" "Catholic"
There are only 3 unique labels, which strongly suggests that school type is a nominal categorical variable. If it were truly a text variable, we might see thousands of unique labels.
We can manually convert the variable to categorical using as.factor():
naplan$school_type <- as.factor(naplan$school_type)str(naplan)'data.frame': 3000 obs. of 11 variables:
$ student_id : int 1 2 3 4 5 6 7 8 9 10 ...
$ school_id : int 51 26 14 21 33 26 37 41 8 60 ...
$ grade : chr "Year 7" "Year 3" "Year 7" "Year 3" ...
$ reading_time_home : int 70 70 30 0 20 25 5 90 45 10 ...
$ parent_education : chr "Certificate/Diploma" "Year 12" "Bachelor degree" "Bachelor degree" ...
$ school_type : Factor w/ 3 levels "Catholic","Government",..: 2 2 2 3 2 2 2 2 2 2 ...
$ gender : chr "Female" "Female" "Male" "Female" ...
$ birth_months : chr "Aug" "Apr" "Jul" "Sep" ...
$ n_siblings : int 2 1 1 1 0 1 3 2 1 3 ...
$ ses_index : num 0.87 0.11 1.14 0.66 0.27 -0.95 -0.65 1.01 -0.6 -1.14 ...
$ naplan_reading_score: int 724 439 631 491 635 592 449 484 730 528 ...
Now we can see school_type is a factor with 3 levels (or 3 unique labels).
Alternatively, we can also convert all strings (text) to categorical variables with the stringsAsFactors argument when we read in the dataset.
naplan <- read.csv("Data-sets/naplan_reading.csv",
stringsAsFactors = TRUE)str(naplan)'data.frame': 3000 obs. of 11 variables:
$ student_id : int 1 2 3 4 5 6 7 8 9 10 ...
$ school_id : int 51 26 14 21 33 26 37 41 8 60 ...
$ grade : Factor w/ 4 levels "Year 3","Year 5",..: 3 1 3 1 4 4 1 3 4 3 ...
$ reading_time_home : int 70 70 30 0 20 25 5 90 45 10 ...
$ parent_education : Factor w/ 5 levels "Bachelor degree",..: 2 5 1 1 5 4 4 1 4 4 ...
$ school_type : Factor w/ 3 levels "Catholic","Government",..: 2 2 2 3 2 2 2 2 2 2 ...
$ gender : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 1 1 1 2 ...
$ birth_months : Factor w/ 12 levels "Apr","Aug","Dec",..: 2 1 6 12 3 10 2 12 11 8 ...
$ n_siblings : int 2 1 1 1 0 1 3 2 1 3 ...
$ ses_index : num 0.87 0.11 1.14 0.66 0.27 -0.95 -0.65 1.01 -0.6 -1.14 ...
$ naplan_reading_score: int 724 439 631 491 635 592 449 484 730 528 ...
As seen above, all the chr variables have been converted to Factor, which is how the variables should be stored. Note that there might be times where you need a variable stored as chr, so you should always be sure of what you’re checking.
We can now attach the dataset. Attaching the dataset means we can call variables directly without using $. For instance, we can call school_type directly as opposed to naplan$school_type.
attach(naplan)If we type naplan directly in, we can see the content.
naplanAlternatively, we can use the View() function to open up a new tab in RStudio with the dataset.
View(naplan)We can also select parts of the dataframe. This can be very useful for deciding which data you want to include a table. In R, we subset by calling the dataset followed by [,] . Writing a blank space before the comma indicates that all rows are includes, while a blank space after the comma indicates that all columns are included. For example, [,3] selects the third column of a dataframe. If we want to select the first three columns, we can write:
Try selecting the first three rows and first three columns of the naplan dataset. You select a number of rows or columns using :. For example, to select rows 3-4, simply type naplan[3:4,]
naplan[1:3,1:3]
naplan[1:3,1:3]Now try selecting only rows 5-6.
naplan[5:6,]
naplan[5:6,]We can even select rows based on logical tests on the values of one or more variables.
For instance, let’s say we want all observations where reading time is greater than 45 mins. We can do this by typing naplan[reading_time_home>45, ].
Try having a go below by selecting rows with reading times greater than 30 mins AND where students havea reading score greater than 800. Hint; you will need to use the logical operator &
naplan[reading_time_home > 30 & naplan_reading_score > 860,]
naplan[reading_time_home > 30 & naplan_reading_score > 860,]The dyplr package can simplify this process. Some helpful functions include select, mutate, and filter. You can load in tidyverse which has dplyr built in, or load in dyplr directly.
library(tidyverse)Let’s filter our data by Year 3 students with reading times at home above 100 mins.
naplan %>%
filter(grade == "Year 3" & reading_time_home > 100) student_id school_id grade reading_time_home parent_education school_type
1 976 3 Year 3 120 Bachelor degree Government
2 1005 44 Year 3 120 Bachelor degree Independent
3 1512 38 Year 3 115 Year 10 or below Government
4 2378 34 Year 3 115 Year 12 Catholic
5 2473 52 Year 3 105 Bachelor degree Government
gender birth_months n_siblings ses_index naplan_reading_score
1 Female Aug 3 0.24 539
2 Male Mar 2 2.20 600
3 Male Aug 3 -1.21 533
4 Female Jul 3 0.07 319
5 Female Sep 1 1.02 455
As we saw above, student_id and school_id is stored as integers. As these numbers actually represent unique labels, they should be treated as categorical. We can mutate student_id and school_id to factor variables.
naplan <- naplan %>%
mutate(across(c(school_id, student_id), as.factor))str(naplan)'data.frame': 3000 obs. of 11 variables:
$ student_id : Factor w/ 3000 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ school_id : Factor w/ 60 levels "1","2","3","4",..: 51 26 14 21 33 26 37 41 8 60 ...
$ grade : Factor w/ 4 levels "Year 3","Year 5",..: 3 1 3 1 4 4 1 3 4 3 ...
$ reading_time_home : int 70 70 30 0 20 25 5 90 45 10 ...
$ parent_education : Factor w/ 5 levels "Bachelor degree",..: 2 5 1 1 5 4 4 1 4 4 ...
$ school_type : Factor w/ 3 levels "Catholic","Government",..: 2 2 2 3 2 2 2 2 2 2 ...
$ gender : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 1 1 1 2 ...
$ birth_months : Factor w/ 12 levels "Apr","Aug","Dec",..: 2 1 6 12 3 10 2 12 11 8 ...
$ n_siblings : int 2 1 1 1 0 1 3 2 1 3 ...
$ ses_index : num 0.87 0.11 1.14 0.66 0.27 -0.95 -0.65 1.01 -0.6 -1.14 ...
$ naplan_reading_score: int 724 439 631 491 635 592 449 484 730 528 ...
We can combine several of these pipelines together. For example, we can select student_id and school_id, rename student_id to student and school_id to school, then mutate them back to integers and store the result in a new dataset called naplan_new.
naplan_new <- naplan %>%
select(student = student_id,
school = school_id) %>%
mutate(across(c(school, student), as.integer))
str(naplan_new)'data.frame': 3000 obs. of 2 variables:
$ student: int 1 2 3 4 5 6 7 8 9 10 ...
$ school : int 51 26 14 21 33 26 37 41 8 60 ...
We can summarise the content of the dataframe as well.
summary(naplan) student_id school_id grade reading_time_home
1 : 1 38 : 87 Year 3:735 Min. : 0.00
2 : 1 20 : 83 Year 5:771 1st Qu.: 5.00
3 : 1 41 : 83 Year 7:745 Median : 15.00
4 : 1 26 : 81 Year 9:749 Mean : 21.62
5 : 1 60 : 81 3rd Qu.: 30.00
6 : 1 48 : 80 Max. :120.00
(Other):2994 (Other):2505
parent_education school_type gender birth_months
Bachelor degree :771 Catholic : 540 Female:1451 Dec : 325
Certificate/Diploma:701 Government :2023 Male :1549 Aug : 314
Postgraduate :297 Independent: 437 Jan : 300
Year 10 or below :466 Sep : 299
Year 12 :765 Jul : 296
Nov : 296
(Other):1170
n_siblings ses_index naplan_reading_score
Min. :0.000 Min. :-2.3200 Min. :177.0
1st Qu.:1.000 1st Qu.:-0.1100 1st Qu.:491.0
Median :2.000 Median : 0.5300 Median :562.0
Mean :1.899 Mean : 0.5246 Mean :558.5
3rd Qu.:3.000 3rd Qu.: 1.1500 3rd Qu.:618.2
Max. :5.000 Max. : 3.4100 Max. :900.0
You might notice that the summary doesn’t quite work for categorical variables.
We can use tapply() and with() to summarise a quantitative variable by a qualitative variable.
with(naplan, tapply(naplan_reading_score, grade, mean)) Year 3 Year 5 Year 7 Year 9
457.1687 539.7821 594.5785 641.1722
aggregate() allows you to summarise by one or more categorical variables, while tapply can only summarise one variable at a time.
aggregate(naplan[,c(4,11)], list(grade),mean) Group.1 reading_time_home naplan_reading_score
1 Year 3 21.59184 457.1687
2 Year 5 21.83528 539.7821
3 Year 7 22.24832 594.5785
4 Year 9 20.80774 641.1722
aggregate(naplan[,c(4,11)], list(Grade = grade),mean) Grade reading_time_home naplan_reading_score
1 Year 3 21.59184 457.1687
2 Year 5 21.83528 539.7821
3 Year 7 22.24832 594.5785
4 Year 9 20.80774 641.1722
aggregate(naplan[,c(4,11)], list(Grade = grade, "Parent Education" = parent_education),mean) Grade Parent Education reading_time_home naplan_reading_score
1 Year 3 Bachelor degree 21.35000 478.5500
2 Year 5 Bachelor degree 19.63054 564.7783
3 Year 7 Bachelor degree 21.15591 618.3065
4 Year 9 Bachelor degree 18.48901 653.3187
5 Year 3 Certificate/Diploma 18.17919 457.6127
6 Year 5 Certificate/Diploma 21.78977 536.1705
7 Year 7 Certificate/Diploma 23.15217 595.2554
8 Year 9 Certificate/Diploma 22.44048 642.1786
9 Year 3 Postgraduate 20.06494 487.3117
10 Year 5 Postgraduate 20.19481 575.4026
11 Year 7 Postgraduate 23.10606 628.7727
12 Year 9 Postgraduate 17.20779 684.5325
13 Year 3 Year 10 or below 22.47475 431.2323
14 Year 5 Year 10 or below 25.49587 505.1983
15 Year 7 Year 10 or below 21.30435 558.1391
16 Year 9 Year 10 or below 21.48855 609.4809
17 Year 3 Year 12 25.18817 435.0914
18 Year 5 Year 12 22.55155 524.3351
19 Year 7 Year 12 22.70619 581.1546
20 Year 9 Year 12 22.56545 632.9686
Exercise: Palmer Penguins
Photo by Derek Oyen on Unsplash
First, we will need to load in some data into R. We will be working with the palmerpenguins package. The palmerpenguins package is a dataset collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.
If you have not used this package before, you will need to install it first.
install.packages("palmerpenguins")Once installed, you will need to load the package into R.
# Load in the `palmerpenguins` package
library(palmerpenguins)
# Load in the `palmerpenguins` package
library(palmerpenguins)Now that we have the package installed, let’s take a look at the variables in the dataset using str() on the penguins dataset.
# Call str() on `penguins`
str(penguins)
# Call str() on `penguins`
str(penguins)You might notice that some values are listed as NA. This means there are missing observations. To check for missing values, try using the embedded any(is.na()) function.
# Try using any(is.na()) on the `penguins` dataset.
any(is.na(penguins))
# Try using any(is.na()) on the `penguins` dataset.
any(is.na(penguins))Since this returns TRUE, we know there are missing values in the dataset. Let’s find out how many.
# Try using sum(is.na()) on the `penguins` dataset.
sum(is.na(penguins))
# Try using sum(is.na()) on the `penguins` dataset.
sum(is.na(penguins))There are missing observations. This can cause problems when generating numerical summaries or running statistical tests. At this stage, to handle this, we can remove the missing values. It’s always best practice to store your cleaned data in a new dataset and leave the original untouched. Note that you should always be careful when removing observations, as even partially filled observations can still provide valuable information.
clean_penguins <- na.omit(penguins)
clean_penguins <- na.omit(penguins)We can check the structure again to make sure everything looks good:
str(clean_penguins)tibble [333 × 8] (S3: tbl_df/tbl/data.frame)
$ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
$ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
$ bill_length_mm : num [1:333] 39.1 39.5 40.3 36.7 39.3 38.9 39.2 41.1 38.6 34.6 ...
$ bill_depth_mm : num [1:333] 18.7 17.4 18 19.3 20.6 17.8 19.6 17.6 21.2 21.1 ...
$ flipper_length_mm: int [1:333] 181 186 195 193 190 181 195 182 191 198 ...
$ body_mass_g : int [1:333] 3750 3800 3250 3450 3650 3625 4675 3200 3800 4400 ...
$ sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 1 2 1 2 2 ...
$ year : int [1:333] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
- attr(*, "na.action")= 'omit' Named int [1:11] 4 9 10 11 12 48 179 219 257 269 ...
..- attr(*, "names")= chr [1:11] "4" "9" "10" "11" ...
We can see that there are eight variables. Three of these are categorical: species, island, and sex. These are stored as factors, where the levels represent the number of distinct labels or groups within each variable. For example, sex has two levels: male and female.
You’ll notice that there are four numerical variables: bill_length_mm, bill_depth_mm, flipper_length_mm, and year. If your instinct is that year doesn’t quite fit as right as a numerical variable, you would be right! It is better to treat year as a categorical variable.
How do we manually convert year to a factor?
clean_penguins$year <- as.factor(clean_penguins$year)
clean_penguins$year <- as.factor(clean_penguins$year)Now if we run class(), we can see that year is now correctly stored as a categorical variable with three levels.
class(clean_penguins$year)[1] "factor"
How can we check the unique groupings or labels within year?
unique(clean_penguins$year)
unique(clean_penguins$year)This shows that the data was collected over the years , , and (ascending order).
You can also see from the structure output str() that the penguins dataset is stored as a 333 x 8 tibble, which is a type of data frame in R. This means that there are 333 rows (observations) and 8 columns (variables).
Check the dimensions of the cleaned dataset to confirm this.
dim(clean_penguins)
dim(clean_penguins)If you wish to have a closer look at inspecting the penguins dataset, you can use the View() function to open it in a new tab on RStudio.
View(penguins)
View(clean_penguins)Tables
Now that we’ve inspected the dataset, we can start thinking about how to visually represent and understand the data.
As we have learnt
library(gt)
library(tidyverse)
library(kableExtra)naplan %>%
dplyr::filter(reading_time_home > 30 & naplan_reading_score > 800) %>%
gt() %>%
tab_header(title = "Naplan") %>%
fmt_number(decimals=2) %>%
cols_label(
student_id = md("***Student ID***"),
school_id = md("***School ID***"),
grade = md("***Grade***"),
reading_time_home = md("***Reading Time***"),
parent_education = md("***Parent Education***"),
school_type = md("***School Type***"),
gender = md("***Gender***"),
birth_months = md("***Birth Month***"),
n_siblings = md("***Number of Siblings***"),
ses_index = md("***SES Index***"),
naplan_reading_score = md("***Reading Score***")
)| Naplan | ||||||||||
Student ID |
School ID |
Grade |
Reading Time |
Parent Education |
School Type |
Gender |
Birth Month |
Number of Siblings |
SES Index |
Reading Score |
|---|---|---|---|---|---|---|---|---|---|---|
| 493 | 1 | Year 9 | 65.00 | Year 12 | Independent | Male | Aug | 0.00 | 1.05 | 900.00 |
| 584 | 1 | Year 9 | 40.00 | Certificate/Diploma | Independent | Female | Sep | 1.00 | 0.68 | 900.00 |
| 626 | 35 | Year 9 | 40.00 | Bachelor degree | Government | Female | Dec | 2.00 | 1.16 | 804.00 |
| 741 | 42 | Year 9 | 105.00 | Year 12 | Government | Male | Aug | 5.00 | −0.44 | 900.00 |
| 1118 | 42 | Year 9 | 95.00 | Certificate/Diploma | Government | Male | Jul | 2.00 | −0.28 | 883.00 |
| 1168 | 31 | Year 9 | 60.00 | Postgraduate | Catholic | Female | Aug | 2.00 | 2.19 | 871.00 |
| 1174 | 35 | Year 9 | 65.00 | Year 12 | Government | Male | Aug | 2.00 | −0.33 | 900.00 |
| 1447 | 23 | Year 9 | 75.00 | Year 12 | Independent | Male | Sep | 2.00 | 0.30 | 900.00 |
| 1792 | 5 | Year 9 | 70.00 | Year 12 | Government | Male | Aug | 1.00 | 0.52 | 900.00 |
| 1906 | 5 | Year 9 | 70.00 | Year 10 or below | Government | Male | Jan | 5.00 | −0.71 | 887.00 |
| 1952 | 35 | Year 9 | 75.00 | Year 12 | Government | Male | Aug | 1.00 | −0.17 | 900.00 |
| 2122 | 5 | Year 9 | 55.00 | Postgraduate | Government | Male | Jul | 3.00 | 1.87 | 900.00 |
| 2359 | 35 | Year 9 | 60.00 | Bachelor degree | Government | Female | Dec | 5.00 | 1.33 | 900.00 |
| 2425 | 47 | Year 9 | 50.00 | Postgraduate | Independent | Female | Aug | 2.00 | 2.02 | 866.00 |
| 2691 | 55 | Year 9 | 80.00 | Year 12 | Government | Female | Dec | 2.00 | −0.18 | 837.00 |
| 2712 | 31 | Year 9 | 60.00 | Year 12 | Catholic | Male | Dec | 2.00 | 0.81 | 809.00 |
| 2756 | 53 | Year 9 | 55.00 | Year 10 or below | Government | Female | Mar | 1.00 | −0.40 | 840.00 |
naplan %>%
filter(gender == "Female", school_type == "Independent") %>%
arrange(desc(naplan_reading_score)) %>%
slice_head(n = 10) %>%
gt() %>%
tab_header(title = "Naplan") %>%
fmt_number(decimals = 2) %>%
cols_label(
student_id = md("***Student ID***"),
school_id = md("***School ID***"),
grade = md("***Grade***"),
reading_time_home = md("***Reading Time***"),
parent_education = md("***Parent Education***"),
school_type = md("***School Type***"),
gender = md("***Gender***"),
birth_months = md("***Birth Month***"),
n_siblings = md("***Number of Siblings***"),
ses_index = md("***SES Index***"),
naplan_reading_score = md("***Reading Score***")
) %>%
data_color(
columns = naplan_reading_score,
method = "numeric",
palette = "viridis",
domain = c(0, 1000),
reverse = TRUE
) %>%
tab_style(
style = list(
cell_fill(color = "gray95"),
cell_borders(sides = c("l", "r"), color = "gray50", weight = px(3))
),
locations = cells_body(columns = -naplan_reading_score)
) %>%
tab_style(
style = cell_fill(color = "gray98"),
locations = cells_title()
) %>%
tab_style(
style = list(cell_fill(color = "gray35"), cell_text(color = "white")),
locations = list(cells_footnotes(), cells_source_notes())
) %>%
tab_style(
style = cell_text(weight = "bold"),
locations = cells_title(groups = "title")
) %>%
tab_style(
style = cell_borders(
sides = c("t", "b"),
color = "darkgrey",
weight = px(3)
),
locations = list(cells_column_labels(), cells_stubhead())
) %>%
tab_footnote(
footnote = "SES Index represents ",
locations = cells_column_labels(columns = ses_index)
)| Naplan | ||||||||||
Student ID |
School ID |
Grade |
Reading Time |
Parent Education |
School Type |
Gender |
Birth Month |
Number of Siblings |
SES Index 1 |
Reading Score |
|---|---|---|---|---|---|---|---|---|---|---|
| 584 | 1 | Year 9 | 40.00 | Certificate/Diploma | Independent | Female | Sep | 1.00 | 0.68 | 900.00 |
| 2425 | 47 | Year 9 | 50.00 | Postgraduate | Independent | Female | Aug | 2.00 | 2.02 | 866.00 |
| 1007 | 1 | Year 9 | 25.00 | Bachelor degree | Independent | Female | Apr | 0.00 | 1.09 | 831.00 |
| 887 | 1 | Year 7 | 50.00 | Bachelor degree | Independent | Female | Jul | 1.00 | 0.85 | 800.00 |
| 2166 | 2 | Year 7 | 60.00 | Bachelor degree | Independent | Female | May | 1.00 | 2.40 | 800.00 |
| 2610 | 23 | Year 9 | 35.00 | Bachelor degree | Independent | Female | Oct | 1.00 | 1.44 | 796.00 |
| 305 | 1 | Year 9 | 35.00 | Certificate/Diploma | Independent | Female | Apr | 3.00 | 0.46 | 788.00 |
| 1220 | 47 | Year 9 | 35.00 | Certificate/Diploma | Independent | Female | Jul | 5.00 | 0.76 | 782.00 |
| 133 | 47 | Year 9 | 55.00 | Year 10 or below | Independent | Female | Sep | 3.00 | 0.36 | 780.00 |
| 1383 | 47 | Year 9 | 30.00 | Certificate/Diploma | Independent | Female | Sep | 2.00 | 1.53 | 774.00 |
| 1 SES Index represents | ||||||||||
naplan |>
dplyr::group_by(gender) |>
dplyr::summarise(
mean_score = mean(naplan_reading_score, na.rm = TRUE),
mean_ses = mean(ses_index, na.rm = TRUE),
.groups = "drop"
) |>
tidyr::pivot_wider(
names_from = gender,
values_from = c(mean_score, mean_ses)
) |>
gt() |>
fmt_number(columns = everything(), decimals = 2) |>
tab_spanner(
label = "Female",
columns = matches("_Female$")
) |>
tab_spanner(
label = "Male",
columns = matches("_Male$")
) |>
cols_label(
matches("mean_score") ~ "Reading Score<br>Average",
matches("mean_ses") ~ "SES Index<br>Average",
.fn=md
) |>
cols_width(everything() ~ px(120))| Female | Male | ||
|---|---|---|---|
Reading Score |
SES Index |
Reading Score |
SES Index |
| 563.00 | 0.50 | 554.21 | 0.55 |
library(gtsummary)naplan |>
tbl_summary(
by = gender,
include=c(grade, reading_time_home, naplan_reading_score),
statistic=list(all_continuous()~"{mean} ({sd})"),
digits=list(naplan_reading_score = c(0,1),
reading_time_home = c(0,1))
)Characteristic |
Female |
Male |
|---|---|---|
| grade | ||
| Year 3 | 350 (24%) | 385 (25%) |
| Year 5 | 378 (26%) | 393 (25%) |
| Year 7 | 356 (25%) | 389 (25%) |
| Year 9 | 367 (25%) | 382 (25%) |
| reading_time_home | 22 (22.9) | 21 (22.7) |
| naplan_reading_score | 563 (95.7) | 554 (92.0) |
| 1
n (%); Mean (SD) |
||
naplan |>
tbl_summary(
include=c(reading_time_home, naplan_reading_score),
type=all_continuous() ~ "continuous2",
statistic=list(all_continuous()~c("{median} ({p25}, {p75})", "{min}, {max}"))
)Characteristic |
N = 3,000 |
|---|---|
| reading_time_home | |
| Median (Q1, Q3) | 15 (5, 30) |
| Min, Max | 0, 120 |
| naplan_reading_score | |
| Median (Q1, Q3) | 562 (491, 619) |
| Min, Max | 177, 900 |
naplan %>%
filter(gender == "Female", school_type == "Independent") %>%
arrange(desc(naplan_reading_score)) %>%
slice_head(n = 10) %>%
select(
`Student ID` = student_id,
`School ID` = school_id,
Grade = grade,
`Reading Time` = reading_time_home,
`Parent Education` = parent_education,
`School Type` = school_type,
Gender = gender,
`Birth Month` = birth_months,
`Number of Siblings` = n_siblings,
`SES Index` = ses_index,
`Reading Score` = naplan_reading_score
) %>%
mutate(
`Reading Score` = cell_spec(
`Reading Score`,
background = spec_color(`Reading Score`, option = "viridis", end = 0.95),
color = "white"
)
) %>%
kable(
format = "html",
align = "c",
booktabs = TRUE,
escape = FALSE,
caption = "Naplan"
) %>%
kable_styling(full_width = FALSE) %>%
row_spec(0, bold = TRUE, color = "white", background = "gray35") %>%
column_spec(
1:10,
background = "gray95",
extra_css = "border-left: 3px solid gray50; border-right: 3px solid gray50;"
) %>% footnote(general = "SES Index represents ...", general_title = "")| Student ID | School ID | Grade | Reading Time | Parent Education | School Type | Gender | Birth Month | Number of Siblings | SES Index | Reading Score |
|---|---|---|---|---|---|---|---|---|---|---|
| 584 | 1 | Year 9 | 40 | Certificate/Diploma | Independent | Female | Sep | 1 | 0.68 | 900 |
| 2425 | 47 | Year 9 | 50 | Postgraduate | Independent | Female | Aug | 2 | 2.02 | 866 |
| 1007 | 1 | Year 9 | 25 | Bachelor degree | Independent | Female | Apr | 0 | 1.09 | 831 |
| 887 | 1 | Year 7 | 50 | Bachelor degree | Independent | Female | Jul | 1 | 0.85 | 800 |
| 2166 | 2 | Year 7 | 60 | Bachelor degree | Independent | Female | May | 1 | 2.40 | 800 |
| 2610 | 23 | Year 9 | 35 | Bachelor degree | Independent | Female | Oct | 1 | 1.44 | 796 |
| 305 | 1 | Year 9 | 35 | Certificate/Diploma | Independent | Female | Apr | 3 | 0.46 | 788 |
| 1220 | 47 | Year 9 | 35 | Certificate/Diploma | Independent | Female | Jul | 5 | 0.76 | 782 |
| 133 | 47 | Year 9 | 55 | Year 10 or below | Independent | Female | Sep | 3 | 0.36 | 780 |
| 1383 | 47 | Year 9 | 30 | Certificate/Diploma | Independent | Female | Sep | 2 | 1.53 | 774 |
| SES Index represents ... |